Language - and domain - independent text mining

نویسنده

  • Mari-Sanna Paukkeri
چکیده

Aalto University, P.O. Box 11000, FI-00076 Aalto www.aalto.fi Author Mari-Sanna Paukkeri Name of the doctoral dissertation Languageand domain-independent text mining Publisher School of Science Unit Department of Information and Computer Science Series Aalto University publication series DOCTORAL DISSERTATIONS 137/2012 Field of research Computer and Information Science Manuscript submitted 4 May 2012 Date of the defence 9 November 2012 Permission to publish granted (date) 11 September 2012 Language English Monograph Article dissertation (summary + original articles) Abstract The field of natural language processing (NLP) has developed enormously during the last decades. The availability of constantly increasing amount of textual data in electronic form has accelerated also the development of statistical methods for NLP, in which characteristics of natural languages are learned from large corpora. Statistical methods have shown their applicability in information retrieval, in which documents of various languages and domains are returned according to user queries, statistical machine translation which is easily applicable to new languages, document clustering to group semantically similar documents, and many information extraction tasks, including keyphrase extraction, document summarization and discovering linguistic features. However, a majority of the NLP research, including also many statistical methods, is concentrated on the English language, using various language-specific tools and resources, such as part-of-speech taggers and ontologies, which are not directly applicable to other languages. Furthermore, methods developed for English alone may not be suitable for languages with different syntax or writing system. In this dissertation, language-independent methods for natural language processing are developed and discussed. Language-independent methods can be applied to a variety of languages without requiring additional language-specific resources. Also dialects, historical forms of languages, languages of few speakers and languages used in specific domains are accessible with language-independent methods. As the main contribution of this thesis, Likey, a language-independent method for keyphrase extraction and feature selection is developed. The method is applied to keyphrase extraction from encyclopedias and scientific articles in eleven languages, and further used as a feature selection method for automatic taxonomy learning and in a novel approach to user modelling in document difficulty assessment. Another major contribution is related to document representations: a set of dimensionality reduction and distance measures are compared in a document clustering task, a novel language-independent direct evaluation method for document representations is proposed, and linguistic features are used for document clustering in a lexical choice task.The field of natural language processing (NLP) has developed enormously during the last decades. The availability of constantly increasing amount of textual data in electronic form has accelerated also the development of statistical methods for NLP, in which characteristics of natural languages are learned from large corpora. Statistical methods have shown their applicability in information retrieval, in which documents of various languages and domains are returned according to user queries, statistical machine translation which is easily applicable to new languages, document clustering to group semantically similar documents, and many information extraction tasks, including keyphrase extraction, document summarization and discovering linguistic features. However, a majority of the NLP research, including also many statistical methods, is concentrated on the English language, using various language-specific tools and resources, such as part-of-speech taggers and ontologies, which are not directly applicable to other languages. Furthermore, methods developed for English alone may not be suitable for languages with different syntax or writing system. In this dissertation, language-independent methods for natural language processing are developed and discussed. Language-independent methods can be applied to a variety of languages without requiring additional language-specific resources. Also dialects, historical forms of languages, languages of few speakers and languages used in specific domains are accessible with language-independent methods. As the main contribution of this thesis, Likey, a language-independent method for keyphrase extraction and feature selection is developed. The method is applied to keyphrase extraction from encyclopedias and scientific articles in eleven languages, and further used as a feature selection method for automatic taxonomy learning and in a novel approach to user modelling in document difficulty assessment. Another major contribution is related to document representations: a set of dimensionality reduction and distance measures are compared in a document clustering task, a novel language-independent direct evaluation method for document representations is proposed, and linguistic features are used for document clustering in a lexical choice task.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

ارائه مدلی برای استخراج اطلاعات از مستندات متنی، مبتنی بر متن‌کاوی در حوزه یادگیری الکترونیکی

As computer networks become the backbones of science and economy, enormous quantities documents become available. So, for extracting useful information from textual data, text mining techniques have been used. Text Mining has become an important research area that discoveries unknown information, facts or new hypotheses by automatically extracting information from different written documents. T...

متن کامل

Topic Modeling and Classification of Cyberspace Papers Using Text Mining

The global cyberspace networks provide individuals with platforms to can interact, exchange ideas, share information, provide social support, conduct business, create artistic media, play games, engage in political discussions, and many more. The term cyberspace has become a conventional means to describe anything associated with the Internet and the diverse Internet culture. In fact, cyberspac...

متن کامل

A multilingual text mining approach to web cross-lingual text retrieval

To enable concept-based cross-lingual text retrieval (CLTR) using multilingual text mining, our approach will first discover the multilingual concept–term relationships from linguistically diverse textual data relevant to a domain. Second, the multilingual concept–term relationships, in turn, are used to discover the conceptual content of the multilingual text, which is either a document contai...

متن کامل

Semantic Content Access Using Domain-Independent NLP Ontologies

We present a lightweight, user-centred approach for document navigation and analysis that is based on an ontology of text mining results. This allows us to bring the result of existing text mining pipelines directly to end users. Our approach is domain-independent and relies on existing NLP analysis tasks such as automatic multi-document summarization, clustering, question-answering, and opinio...

متن کامل

Painless Labeling with Application to Text Mining

Labeled data is not readily available for many natural language domains, and it typically requires expensive human effort with considerable domain knowledge to produce a set of labeled data. In this paper, we propose a simple unsupervised system that helps us create a labeled resource for categorical data (e.g., a document set) using only fifteen minutes of human input. We utilize the labeled r...

متن کامل

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012